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Abstract 

Many modern statistical applications involve infer- 
ence for complicated stochastic models for which the 
likelihood function is difficult or even impossible to 
calculate, and hence conventional likelihood-based in- 
ferential techniques cannot be used. In such settings, 
Bayesian inference can be performed using Approx- 
imate Bayesian Computation (ABC). However, in 
spite of many recent developments to ABC method- 
ology, in many applications the computational cost 
of ABC necessitates the choice of summary statistics 
and tolerances that can potentially severely bias the 
estimate of the posterior. 

We propose a new "piecewise" ABC approach suit- 



able for discretely observed Markov models that in- 
volves writing the posterior density of the parameters 
as a prodiict of factors, each a function of only a sub- 
set of the data, and then using ABC within each fac- 
tor. The approach has the advantage of side-stepping 
the need to choose a summary statistic and it enables 
a stringent tolerance to be set, making the posterior 
"less approximate". We investigate two methods for 
estimating the posterior density based on ABC sam- 
ples for each of the factors: the first is to use a Gaus- 
sian approximation for each factor, and the second is 
to use a kernel density estimate. Both methods have 
their merits. The Gaussian approximation is simple, 
fast, and probably adequate for many applications. 
On the other hand, using instead a kernel density es- 
timate has the benefit of consistently estimating the 
true ABC posterior as the number of ABC samples 



tends to infinity. We illustrate the piecewise ABC ap- expressed via the density Tr{0). The following Algo- 

proach for three examples; in each case, the approach rithm 1 generates exact samples from the Bayesian 

enables "exact matching" between simulations and posterior density '!t{9\X) which is proportional to 

data and offers fast and accurate inference. 7t{X\6)tt{6): 



1 Introduction 



Stochastic models are commonly used to model pro- 



cesses in the physical sciences Wilkinson 2011 



Van Kampen 2007| . For many such models the like- 
lihood is difhcult or costly to compute making it in- 
feasible to use conventional inference techniques such 
as maximum likelihood estimation. However, pro- 
vided it is possible to simulate from a model, then 
"implicit" methods such as Approximate Bayesian 
Computation (ABC) methods enable inference with- 
out having to calculate the likelihood. These methods 
were originally developed for applications in popula- 
tion genetics 



Algorithm 1 

Exact Bayesian Computation (EBC) 
1: Sample 0* from tt{0). 

2: Simulate dataset X* from the model using param- 
eters 0*. 

3: Accept 0* li X* — X, otherwise reject. 
4: Repeat. 

This algorithm is only of practical use if X{t) is 
discrete, else the acceptance probability in Step 3 is 
zero. For continuous distributions, or discrete ones 
in which the acceptance probability in step 3 is un- 



Pritchard et al. 1999 and human de 



mographics Beaumont et al. 2002 , but are now be 



ing used in a wide range of fields including epidemiol- 



ogy McKinley et al. , 2009 , evolution of species Toni 



et al. 2009 , finance Dean et al. 2011 , and evolution 



of pathogens Gabriel et al. 2010 , to name a few. 



Intuitively, ABC methods involve simulating data 
from the model using various parameter values and 
making inference based on which parameter values 
produced realisations that are "close" to the observed 
data. Suppose that our data are a set of observa- 
tions denoted X = {xi, . . . ,x„} = {x{ti), . . . ,x{tn)} 
of state variable X{t) e M™ at time points ii, . . . , i„. 
We assume that the data arise from a Markov 
stochastic model (this encompasses IID data as a 
special case) parameterised by the vector 0, which 
is the target of inference. Prior beliefs about are 



acceptably low, Pritchard et al. 1999 suggested the 
following algorithm: 

Algorithm 2 

Approximate Bayesian Computation (ABC) 

As Algorithm 1, but with step 3 replaced by: 

3': Accept 0* if d{s{X), s{X*)) < e, otherwise reject. 



where d(-, ■) is a distance function, usually taken to 
be the L2-norm of the difference between its argu- 
ments; s(-) is a function of the data; and e is a tol- 
erance. Note that s(-) can be the identity function 
but in practise, to give tolerable acceptance rate, it 
is usually taken to be a lower dimensional vector of 
summary statistics that characterise key aspects of 
the data. 

The output of the ABC algorithm is a sam- 
ple from the ABC posterior density Tr{6\X) = 



2 



7T{9\d{s{X),s{X*)) < e). Provided s(-) is sufficient 
for 6, then the ABC posterior density converges to 
Tr{9\X) as £ — ?> 0. However, in practise it is rarely 
possible to use an s(-) which is sufficient, or to take 
e especially small (or zero). Hence ABC requires a 
careful choice of s(-) and e to try to make the accep- 
tance rate tolerably large, at the same time as trying 
not to make the ABC posterior too different from 
the true posterior, Tr{d\X). In other words, there is 
a balance which involves trading off Monte Carlo er- 
ror with "ABC error" owing to the choice of s(-) and 
tolerance e. 

Over the last decade, a wide range of extensions to 
the original ABC algorithm have been developed, in- 
cluding Markov Chain Monte Carlo (MCMC) [Marjo 



ram et al. 2003 and Sequential Monte Carlo (SMC) 



Toni et al. 2009 implementations, the incorpora- 



tion of auxiliary regression models [Beaumont et al. 



2002 Blum and Franois 2010 , and (semi-)automatic 



choice of summary statistics Fearnhead and Pran- 
see 



gle 2012 



Marin et al. 2011 for a review. In 



all of these ABC variants, however, computational 
cost remains a central issue, since it is the compu- 
tational cost that determines the balance that can 
be made between controlling Monte Carlo error and 
controlling bias arising from using summary statistics 
and/or non-zero tolerance. 

In this paper we propose a novel algorithm called 
piecewise ABC (PW-ABC), the aim of which is to 
substantially reduce the computational cost of ABC. 
The algorithm is applicable to a particular (but fairly 
broad) class of models, namely those with the Markov 
property and for which the state variable is observ- 
able at discrete time points. The algorithm is based 
on a factorisation of the posterior density such that 



each factor corresponds to only a subset of the data. 
The idea is to apply the ABC algorithm for each fac- 
tor (a task which is computationally very cheap), 
to compute the density estimates for each factor, 
and then to estimate the full posterior density as 
the product of these factors. Taking advantage of 
the factorisation lowers the computational burden of 
ABC such that the choice of summary statistic and 
tolerance — and the accompanying biases — can poten- 
tially be avoided completely. 

In the following section we describe PW-ABC in 
more detail. The main question of the method is how 
to use the ABC samples from each posterior factor 
to estimate the full posterior density. We discuss two 
approaches to estimating the relevant densities and 
products of densities, then we apply PW-ABC, using 
both approaches, to three examples: a toy illustra- 
tive example of inferring the probability of success in 
a binomial experiment, an autoregressive time-series 
model, and a dynamical predator-prey model. We 
conclude with a discussion of the strengths and lim- 
itations of PW-ABC, and of potential further gener- 
alisations. 

2 Piece-wise ABC (PW-ABC) 

Our starting point is to use the Markov property to 
write the likelihood as 

9) Tr{xi\9) 



[]7r(x,|x,_i,0)U(a;i|0). 



(1) 



\i=2 



The likelihood contribution of the first observation 
xi is asymptotically irrelevant as the number of ob- 
servations, n, increases and, henceforth, to keep the 



3 



presentation simple, we ignore the term 7t(xi\9) in where 
Q. Accounting for this, and by using multiple ap- 
plications of Bayes' theorem, the posterior density 
can be written in the following factorised form. 



TT{e\x) cx TT{x\e)TT{e) 



oc 



^jj 7r(.Tj|xj-i,6l)7r(6l) j ^^^^ 



(2) 



where 



Ci ^ J n{xi\xi^i,8)7r{9)dd. 

Essentially, in ([2| the posterior density, t:{9\X), of 
9 given the full data X has been decomposed into a 
product involving densities f i{9), each of which de- 
pends only on a pair of data points, {xi-.i,Xi}. 

The key idea now is to use ABC to draw 
approximate samples from each of the densities 
ipi{9). Applying Algorithm [2] involves (i) draw- 
ing 9* from 7r(6'), (ii) simulating x*\xi^i,9* , and 
(iii) accepting 9* if di^s{xi), s{x*)) <e. We use 
(pi{9) to denote the implied ABC density from which 
these samples are drawn (with <^i(^?) = ^i{9) if 
s(-) = Identity(-) and e — 0). By repeating 
(i) — (iii) we generate samples of, say, m draws, 

^iXi)' ■ ■ ■ '^i(m)' ^^'^^ 'fi{ff)- Now, suppose 

that (pi{9) is an estimate, based on the sample 
9*(^i-j, . . . , 6'*(„), of the density 'fi{9) (and hence of the 
density ^i{9)). Then the posterior density ^ can be 
estimated by 



mx)^9{9) g{9)A9, 



m = ^{9f^-u\[uo) 



(4) 



\i=2 



The steps of PW-ABC are summarised in Algorithm 
3. 

Algorithm 3 Piece-Wise Approximate Bayesian 

Computation (PW-ABC) 
for z = 2 to n do 

a: Apply the ABC Algorithm to draw m approx- 
imate (or exact, if s(-) = Identity(-) and e = 0) 
samples, 9*^^^y. . . , 9*^^^-^, from (p,{9); 



b: Using the samples 9*,^ 



(1)' 



■i(m) 



and either 



(3) 



Q or (12), calculate a density estimate, (pi{9), 

of ipm- 

end for 

Substitute the density estimates (pi{9) into ^ to 
calculate an estimate, Tr{9\X), of 7r{9\X). 

On the question of how to calculate the density es- 
timates, (pi{9), below we discuss two approaches: (i) 
using a Gaussian approximation, and (ii) using a ker- 
nel density estimate. Henceforth, quantities based on 

(i) are denoted by superscript g, and those based on 

(ii) are denoted by superscript k. In both cases we 
discuss the behaviour of the estimators in the asymp- 
totic regime in which the number of observations, n, 
is kept fixed while the size of each ABC sample in- 
creases, m — >■ oo. 

2.1 Gaussian approximation for 0i{6) 

Denote the d-dimensional multivariate Gaussian den- 
sity with mean, /i, and covariance, E, by 

K{9; M, S) = (27r)-''/2 (det I])"!/^ x 

exp(^-l{9-fifj:-'{9-fi)y (5) 
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A Gaussian approximation for (pi{0) is 

^f{e)^K{9;9lQ.), (6) 

where 



Q^ 



^ m 



are the sample mean and sample covariance of the 
ABC posterior sample 0*^^^ , . . . , ^*(„j) ■ A consequence 
of using Q is that the product of the density approx- 
imations is also Gaussian: 



(7) 



where 



B=\^Qi'j > (8) 

« = ^(EQr'e":j , (9) 

n 

w = det(27rB)i/2[|det(27rQ,)-i/2x 

n n 

n n^^p - 'eifRsM - ei)) , (lo) 



s=2 t>s 



(11) 



We note the following properties of approxima- 
tion If the densities ifi{0) from which the 
d*(^ip ■ ■ • , ^i(m) are drawn are Gaussian, i.e., ipi{9) — 
K{9; fii, Ei), then 9* and Qi are unbiased and consis- 
tent estimators of /i,; and E^ , respectively, and hence 
a and i? are consistent estimators of the true mean 
and covariance of H'^iC^)- More generally, for 'fi{9) 
which is not necessarily Gaussian, 9* and Qi are 



consistent estimators of the mean and the variance 
of the Gaussian density, iff (9), which minimises the 
KuUback-Leibler divergence, 

KL(<^,(0)||(^f(0)) = J ^,(0)log {US)/^Ue))d9; 

i.e., for each i, (pf{9) is asymptotically the "optimal" 
Gaussian approximation to (pi{9). No such relevant 
optimality holds for the product of densities, how- 
ever: the (normalised) product of Gaussians, each of 
which is closest in the KL sense to (pi{9), is in general 
not the Gaussian closest to (the normalised version 
of) n '^i(^); aiid indeed it may be very substantially 
different. In other words, as m — >■ cx), a and B do not 
in general minimise 

KL(^|n^.(^)/ / (n^«w)}||w«,s) 

2.2 Kernel density estimate for 'pi{6) 



A second method we consider is to estimate each den- 
sity (pi{9) using a kernel density estimate (see for in- 
stance Silverman 1986 and Wand and Jones [1995 ) 



A kernel density estimate based on Gaussian kernel 
functions ([5| is 



\K{0-91^,yH^i), 



(12) 



where Hi is a bandwidth matrix. We follow the ap- 



proach of Fukunaga 1972 in choosing the bandwidth 



matrix such that the shape of the kernel mimics the 
shape of the sample, in particular by taking Hi to 
be proportional to the sample covariance matrix, Qi . 
Using bandwidth matrix 

-2/(d+4) ^. 



Hi = q ■ m 



(13) 



where g > is a constant not dependent on to, en- 
sures desirable behaviour as the sample size to — > oo. 
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In particular, in terms of the little-o notation (om — expansion of (16); see Wand and Jones 1995 pill 
o{bm) as m — > oo denotes limm_j.oo \o-m/bm\ — 0) and 
with E denoting expectation, using choice of band- 



width (131, subject to mild regularity conditions on 



E {^^{ef} ^ ^,{er + o{i) 



(14) 
(15) 



Analogous calculations are rather more involved in 
the product case, however: even with the assumption 
that each (^^(6') is Gaussian, no closed expression for 
q is possible. Hence, in the examples in the following 
section, fjsj we opted to tune q in the heuristic way 
described by Wand and Jones 1995 , starting with a 



From ([l4])-([l5]), the bias, h{ip}{e)} = E {ip}{9)} - 
ipi{9), the variance, var{(^^(6')} = E {(p^{d)'^} - 
E ^(p^{9)^^, and the mean integrated squared error, 

MISE{^^} ^eJ {^^(9) - f>mYd9, (16) 

are all o(l). These results generalise routinely to 
the case of a product of n kernel density estimates, 
that is, in which Yl is used as an estimator for 

Y\'f>i{9)- It follows that since the 6**^-) are indepen- 
dent for all then, using ([l4|)-(|T5l) , 



large q (ten times that in ( 17)) then reducing it until 
"random" fluctuations begin to appear in the density 
estimates. 

A consequence of using Gaussian kernel functions 
([5]) in ( 12 ) is that the product of the density approx- 
imations is then itself a weighted mixture of (n— 1)™ 
Gaussians, 



7n n 
, 1=2 

"j2,.-,jn.K{9;aj^^...j^,Bj2^,, 



Hi) 



E 



E 



J, (18) 



0(1] 



b {n v'Ho)} = {n ^ ^'(^)} - n ^^(^) = 
var {n 0i{o)} = n ^ {^'(^)'} - n {^^'(^)}' 

MISe{J|(^J=} = E f (J['p^i9) -Y[^,{9)yd9 = o(f)2:-,i^' and weights Wj^.-.j., analogous to those in 

Hence, in the sense defined by the latter equation, 
the density estimator H "^K^) converges to the true 
density Il'^i(^) as m — > 00. 

Regarding the choice of q in ( 13 ) , in certain settings 
it is possible to determine an optimal value. Suppose 
that the true density <fi{9) is Gaussian and let 0^{9) 



Jere expressions for the covariances Bj 



(10), are given in the Appendix. 



2.3 Estimating the posterior density 



in (12) be a kernel density estimate of (pi{9). Then 
g = {(d + 2)/4}-2/('i+4) (17) 



is optimal in the sense that ( 13 ) is then an unbiased 
and consistent estimator of the bandwidth that min- 
imises the leading term of the large-m asymptotic 



Sections p.l| and §2.2| describe methods for com- 
puting the factor Y\ ipi (9) in ^ . For calculating 
an estimate of the full posterior, n{9\X) in ([3|, we 
must multiply n'^i(^) by 7r(0)(^~"^ and normalise. 
Let us suppose that the prior is Gaussian, n{9) = 
i4r(6'; /ipri, Spri). For the case where we are using 
the Gaussian approximation, ipf{9) from for each 
(pi{9), then the posterior is 



TT^9\X) = K{9;fip,,t,^post), 



(19) 
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where ^ leads to 

Spost= ((2-n)I]-,i+S-i)"\ (20) 
/Zpost = Spost ((2 - ?^)Spri /ipri + B"^a^ , (21) 



-1/2 . Mety]„„„.^i/2 . Metr5^y]„,.nV"/2-i)x 



w ■ (detB)-'^' ■ (detSpost)'^' • (det(27rSp,i))^ 



and a and B are as defined in 0. |-l(a - ^p^)^ ((2 - ny^j:^,, + B)'' {a - /^p,i)|, 

If instead we use tfie kernel approximation, (p^{6) 



from (12), for each ipi{0), then the posterior density 
is 



(25) 



whereas using the kernel approximation (12 1 gives 



m \i=2 / j2,---,jn 

E (22) 



where expressions for B'^^ j^, '^h,---,]^ ^^'^ ^'j2,---,jr, 
are in the Appendix. 



2.5 Practical numerical calculations 
for the kernel approximation 



2.4 Estimating the marginal likeli- 



Since expressions (18), (22), (26) for the kernel case 
involve sums with {n — 1)™ terms, these expressions 
are largely of academic interest and are typically not 
hood suitable for practical calculations. For the examples 

in this paper we used a more direct numerical ap- 
proach, first writing ^ as 



In some applications, especially when model compar- 
ison is of interest, it is useful to compute the marginal 
likelihood of the data given the model. The marginal / n \ 

likelihood is 5(^) = exp ^ h,{e) 7r(6i), 



tt{X) - I TT{X\e)T:{e)de (23) ^^^^^ ^^^^^ ^ {^^{0)/T:{e)), and then evaluating 

\ /• / " \ hi{9), n{9) and hence g{9) pointwise on a fine lattice. 

Y\ci] / \T](p^(9)]^T(0)^-''d0. (24) r ■ 1 1 • 1 

-LJ- / Performing calculations m this way on the log scale 



\i=2 



avoids underflow errors and improves numerical sta- 
The unknown Ci can be estimated by Cj = m/Mi, bility compared with trying to evaluate (j4]) directly, 
where Mi equals the number of ABC draws necessary As a further check for robustness, we varied the lat- 
in the ith interval to achieve m acceptances. For the tice position and resolution to make sure the results 



integral in (24), using the Gaussian approximation were insensitive to the particular choices. 
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2.6 Sampling from the posterior dis- 
tribution 

In some circumstances it may be desirable to draw 
samples from the approximate posterior density. In 



the Gaussian case, drawing from ( 19 1 is straightfor- 



ward. For the kernel case, (22), in principle sampling 



can be achieved by normalising the weights, ran- 
domly choosing a component with probability equal 
to these normalised weights, then sampling from the 
selected Gaussian component. But in practise, again. 



the large number of terms in ( 22 1 can preclude this 



approach. Other possibilities include using a Gibbs 
sampler, or sampling approximately using Gaussian 
mixtures with fewer components; see |Sudderth et al.| 



2003 . 



3 Some examples 

In this section we test PW-ABC on synthetic data 
from three models. The first, as a toy illustrative 
example, involves inferring from IID data the prob- 
ability of success in a binomial experiment. Second, 
we consider an integer- valued time series model called 
INAR(l), a model for which the likelihood is avail- 
able (albeit awkward to compute) and enables com- 
parison of our approach with the "gold standard" 
MCMC approach. Third, we consider a stochastic 
Lotka-Volterra model, a simple example from a com- 
mon class of models (which occur, for instance, in 
modelling stochastic chemical kinetics) in which the 
likelihood, and therefore many standard methods of 
inference, are unavailable. The datasets for each ex- 
ample are given in the Supplementary Material. 



3.1 Binomial model 

For this toy example we suppose the data is the set 
X = {xi, . . . ,xio} of n = 10 observations from the 
model Xi ^ Binom(fcj = 100, p = 0.6). We work 
in terms of the transformed parameter d = logit(p), 
using a prior 7r(6') ~ iV(0, 3^). For this model the 
data are IID, so that ■K{xi\xi^\,&) — 'K{xi\&). Exact 
samples from can be obtained by sampling Q* 

from the prior, sampling X* ^ Binom(100, 0*), and 
then accepting d* if and only if X* = Xi . We follow 
the PW-ABC approach described in Section [2j draw- 
ing TO — 5000 samples from each ipi{9), using these 
samples to construct Gaussian (^f (0) and kernel den- 
sity 'p\{0) approximations, then using these density 
approximations to construct approximate posterior 
densities, TT^{e\X) and T:^{e\X). Figure[l] shows that 
the approximate posterior densities are very close to 
the true posterior density for this example. The true 
log marginal likelihood, log7r(A'), computed by di- 
rect numerical integration of ( [23| , is —31.39; using 
approximation '{'^{0) and (25) gives —31.44; and us- 
ing approximation (p\{9) and numerical integration 



of the left-hand side of (26) gives —31.48 



3.2 An integer-valued autoregressive 
model 

Integer-valued time series arise in contexts such 



as modelling monthly traffic fatalities Neal and 



Subba Rao 2007 or the number of patients in a hos- 



pital at a sequence of time points Moria et al. 2011 



Consider the following integer-valued autoregressive 
model of order p, known as INAR(p) : 



Xt ^y^^ajo Xt 



t e Z, 



(27) 



4=1 
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Figure 1: Results for the binomial model in ^3.1 
Shown are the true posterior density, tt(0\X), the pos- 
terior density approximations ■Tr^{9\X) and Tt^{6\X), 
the prior, and the true 9. 

where Zt for t > 1 arc independent and iden- 
tically distributed integer-valued random variables 
with E[Zf] < oo, with the assumed to be inde- 
pendent of the Xt. Here we assume Zt ~ Po{X). 
Each operator a^o denotes binomial thinning defined 

by 



ai oW — 



Binomial(Vr, a^), W > 0, 
0, = 0, 



(28) 



for non-negative integer-valued random variable W. 
The operators a^o, i — are assumed to be 

independent. 

We consider the simplest example of this model, 



INAR(l) [see, for example, Al-Osh and Alzaid 



1987| , supposing that we have some observed data 
X = {xi, . . . , Xn} from this model and wish to make 
inference for the parameters (a, A). We generated 
n = 100 observations from an INAR(l) process us- 
ing parameters (a. A) = (0.7,1) and X{0) = 10; 




Figure 2: The realisation of an INAR(l) process used 
in the example of j ]3.2[ of length n — 100, generated 
using a — 0.7 and A — 1.0. 

the realisation is plotted in Figure [2] Working in 
terms of the transformed parameter, 9 = {9 1^62) = 
(logit(Q!), log(A)), we used a prior of Norm(0, 3^) for 
each of 9i and 92- For the EEC algorithm, the prob- 
ability of acceptance is around 10~^°", which is pro- 
hibitively small; even the ABC algorithm requires a 
value of e so large that sequential approaches (e.g., 
SMC-ABC) are needed. 

Using PW-ABC with s(-) = Identity(-) and e 
we were able to draw exact samples from ^Pi{9) for 
all of the i — 2, . . . , 100 factors, and still achieve ac- 
ceptance rates of around 9%, on average. Figure [3] 
shows an estimate of the posterior density, 'n{9\X) 
based on a gold-standard MCMC approach, together 
with Gaussian- and kernel-based PW-ABC approx- 
imations, TT^{9\X) and 7r''(6'|A'), with m = 10,000 
samples for each (pi{9). The Figure shows good agree- 
ment between the MCMC posterior and the kernel 
approximation, 'n:^{9\X), but poor agreement with 
the Gaussian approximation tt^(9\X). The poor per- 
formance of Tt^{9\X) is probably caused by several 
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Figure 3: Results for the INAR(l) example of p72 
Shown are an MCMC approximation to the posterior 
density, tt(9\X), the posterior density approximations 
TrS{e\X) and 7:^{9\X), the prior, and the true 6. The 
numbers on the contours denote the probability mass 
that they contain. 

of the densities ipi{0) being substantially different 
from Gaussian; see Figure [4] which shows <^fo(^) and 
ip^o{d) — the latter suggests that the density is far 
from Gaussian. Using Gaussian approximations to 
non-Gaussian ipi{0) appears to have a strong impact 
on the accuracy of approximation tt^{9\X), even as 
in this case where the true posterior Tr{6\X) is rea- 
sonably close to a Gaussian (cf. Fig. |3]). 

For this example, estimates of the marginal likeli- 
hood, log 7r(A'), are as follows: by direct numerical in- 



tegration of (23), —161.1; using approximation <^f (0) 
and (25), —185.7; and by using (p^{9) and numerical 



integration of the left-hand side of (26), —163.2. 

We have used p = 1 for this example so 
that the likelihood is available, enabling compar- 
ison with MCMG and calculation of the true 
marginal likelihood. However, we stress that PW- 



-0.95 




Figure 4: For the INAR(l) example, an example 
of a factor with a "non-Gaussian" density: here 
'^50 (^) ^^'i "^50 (^) substantially different from 
each other. 

ABC can be applied equally easily for p > 1, 
a case for which the likelihood is essentially in- 
tractable and therefore one has to resort to ei- 
ther exact but less direct methods (such the 
Expectation— Maximization (EM) algorithm or data- 
augmented MCMC, both of which involve treating 
the terms o Xt^i and Zf as missing data) or meth- 
ods of approximate inference, such as conditional 
least squares which involves minimizing '^f{Xt — 



E[Xt\Xt-i]) , see, for example, McKenzie 2003 and 
references therein. 

3.3 Stochastic Lotka— Volterra Dy- 
namics 

The stochastic Lotka- Volterra (LV) model is 
a model of predator-prey dynamics and an 
example of a stochastic discrete-state-space 
continuous-time Markov process [see, for ex- 
Models of this type 



ample, Wilkinson 2011 
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commonly arise when modelling chemical kinetics. 
Predator-prey dynamics can be thought of in 
chemical kinetics terms: the predators and prey 
are two populations of "reactants" subject to three 
"reactions" , namely prey birth, predation and 
predator death. Exact simulation of such models 
is straightforward, e.g., using the algorithm of 
Gillespie (1977). Inference is simple if the type and 
precise time of each reaction is observed. However, 
a more common setting is where the population 
sizes are only observed at discrete time points. In 
this case the number of reactions that have taken 
place is unknown and therefore the likelihood is not 
available and hence inference is much more difhcult. 
Reversible-jump MCMC has been developed in this 
context [Boys et aL 2008| but it requires substantial 
expertise and input from the user to implement. 



Particle MCMC (pMCMC) methods Andrieu et al 



2010 



which provide an approximation to the 
likelihood via a Sequential Monte Carlo (SMC) 
algorithm within an MCMC algorithm, have recently 
been proposed for stochastic chemical kinetics 
models Golightly and Wilkinson 2011 . Although 



being computationally intensive, such methods can 
work reliably provided the process is observed with 
measurement error. The R package smfsb, which 



accompanies Wilkinson 2011 , contains a pMCMC 



implementation designed for stochastic chemical 
kinetics models, and we use this package to compare 
results for PW-ABC and pMCMC for the following 
example. 

Let Yi and Y2 denote the number of prey and 
predators respectively, and suppose Yi and Y2 are 
subject to the following reactions 



which respectively represent prey birth, predation 
and predator death. We consider the problem of 
making inference about the rates (ri, r2, T3) based on 
observations of Yi and Y2 made at fixed intervals. 
We generated a realisation from the stochastic 



LV example of [Wilkinson [2011[ page 208], that is 



model ([29]) using (ri,r2,r3) = (1,0.005,0.6), Yi(0) = 
50 and 1^2(0) — 100. We performed inference in 
terms of transformed parameters, 9 = {01,62,03) = 
(log ri, log r2, log 7^3), this time with priors 7r(0i) ~ 
Norm(log(0.7),0.5), 7r(6'2) - Norm(log(0.005), 0.5), 
and 7r(6'3) ^ Norm(log(0.3), 0.5). We again applied 
PW-ABC using s(-) = Identity(-) and e = 0, in other 
words requiring an exact match between the observed 
and the simulated observations, to draw samples of 
size m — 10, 000 for each ipi{0). 

To obtain pMCMC results we found it necessary 
to assume an error model for the observations, hence 
we assumed errors to be IID Gaussian with mean 
zero and standard deviation equal to 2. Results are 
displayed in Figure [3?3l which shows plots for univari- 
ate and pairwise bivariate marginal posterior densi- 
ties for the pMCMC results, and for the PW-ABC 
approximations, 7rS{0\X) and Tt^{0\X). Both of the 
PW-ABC approximations agree well with each other 
and with the pMCMC results for this example. 

4 Conclusion and Discussion 



^1^2^!, Yi+Y2^2Y2, 



PW-ABC aims at using Markov structure to lower 
the computational cost involved in ABC. The major 
advantage of PW-ABC over ABC is that for a given 
summary statistic, s(-), and tolerance, e, acceptance 
rates for sampling from ipi{0) are overwhelmingly 
Y2 ^ 0, (29) higher in comparison to sampling from n{0\X). Po- 



ll 



tentially PW-ABC enables the choice s = Identity(-), 
sidestepping the difficult task of deciding what sum- 
mary statistic is appropriate (i.e., "approximately 
sufficient" ) and eliminating ensuing biases that result 
from the choice. 

Having sampled from each ipi{0) the question then 
becomes how to estimate Tr{6\X) using these samples. 
PW-ABC works by constructing density approxima- 
tions (pi{0) to each (pi{0). The approach of taking 
ipi{0) to be Gaussian, with moments matched to the 
sample moments, is computationally cheap, and if 
the prior is also taken to be Gaussian then there 
is a closed form expression for the Gaussian poste- 
rior density and marginal likelihood, making calcula- 
tions extremely fast. Taking ipi{0) to be Gaussian is 
perhaps adequate in many applications: performance 
was strong in two of the three examples we consid- 
ered. The poor performance in the INAR example 



of ^3.2 is probably because some of the (piiO) are so 
different from Gaussian. It is striking to see an ef- 
fect so strong when the true posterior is so close to 
Gaussian. Unfortunately, increasing the number, m, 
of AEG samples is no remedy to this problem: as 
m — )■ oo, the product of Gaussians, itself a Gaussian, 
in general does not converge to the Gaussian closest 
in the KuUback-Leibler sense to the target density. 

In terms of asymptotic performance, using the 
kernel approximation, <p}{9), for (pi{9) is preferable 
since, in this case, the estimated posterior density 
converges to the target as to — >■ oo. The kernel ap- 
proach is computationally more demanding, however, 
and also calls for a heuristic choice of a scalar smooth- 
ing parameter. In spite of its asymptotic properties, 
its practical use is probably limited to problems in 
which 9 has small dimension. 



A more general possibility which we will explore 
in future work is to let ipi{9) be a mixture of, say, u 
Gaussians. This encompasses ^ and (12) as special 
cases (with u — 1 and u = m respectively). For a 
general mixture model for (pi{9), each of the compo- 
nent Gaussians is parameterised by a scalar weight, 
a mean vector and a covariance matrix which need to 
be determined. We would envisage regularising, e.g., 
by setting each covariance to be equal up to scalar 



multiplication, perhaps as for (12 1 taking the covari- 



ance equal to the sample covariance, and then fitting 
each (^i(0) based on the samples from ipi{9) using, 
say, an EM algorithm. This approach is a compro- 
mise between (|6| and (12). It does not share the 
property of ( [l2| that estimated densities converge to 
the true densities as to — >■ cx), but on the other hand 
it is computationally much less involved and offers 
much extra freedom and flexibility over ([6]), partic- 
ularly for dealing with multimodal densities. If u 
is taken sufficiently small then it may be feasible to 
work explicitly with the (n— l)"-term resulting Gaus- 
sian mixture, Jl'i^*(^)' enabling explicit calculations 
involving the posterior density, such as computing 



the marginal likelihood, analogous to (25), and direct 



sampling from the approximate posterior density (see 



S2.6) 



Several further generalisations of the PW-ABC 
approach are possible. In ([!]), each of the n — 
1 factors n{xi\xi^i,9)^ i — 2,...,n is the like- 
lihood for a single data point conditional on the 
previous. An alternative possibility is to factorise 
the likelihood into fewer factors, with each corre- 
sponding to a "block" of multiple observations, e.g., 
Tr{xi^y. , XiJf-y.^i, . . . , Xi\xi-i, 9) for some choice of Ui, 
and the factorised likelihood becomes a product over 
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the relevant subset of i = 2, . . . , n. To an extent this 
potentially reintroduces difficulties that with PW- 
ABC we sought to avoid, namely lower acceptance 
rates leading to a possible need to use a summary 
statistic and non-zero tolerance (and the ensuing 
ABC error they bring). On the other hand, we might 
expect, owing to the central limit theorem, that a 
factor ^i{9) which depends on several data points 
will be closer to Gaussian than a factor dependent 
on only a single data point, and hence that ([6]) and 



(12) (especially the former) will perform better. 

If using larger "blocks" of data in the factorisation 
makes it necessary to use a non-zero tolerance e > 
(or if e > is necessary even when using a single 
observation per factor) then there are theoretical ad- 



vantages to using what Fearnhead and Prangle 2012 



call "noisy ABC" . In the context of this paper, noisy 
ABC would involve replacing the summary statistic 
s(-) with a random version s'(-) — s(-) -I- a, where 
a is a realisation of a random variable with density 
uniform on the ball {x\d{s{X),x) < e}. Using noisy 
ABC ensures that, under mild regularity conditions, 
as n — 7- oo, the posterior converges to a point mass at 
the true parameter value; see §2.2 of [Fearnhead and| 



Prangle 2012 . 



Recently, we have learnt of an interesting paper by 



the iih factor by sampling from the prior, EP-ABC 
draws samples from an iteratively updated pseudo- 
prior. The pseudo-prior is a Gaussian approximation 
to the component of the posterior that involves all the 
data except those pertaining to the ith factor. The 
use of the pseudo-prior offers a high acceptance rate 
in the ABC sampling and so EP-ABC can potentially 
lead to an extremely fast approximation to the full 
posterior Tr{d\X). A disadvantage is that conditions 
sufficient for the convergence of EP-ABC (or even 
the simpler deterministic EP) are not known. Also, 
as with using PW-ABC with 0, since EP-ABC uses 
a Gaussian approximation for each factor, it is po- 
tentially ill-suited to problems with complicated (e.g. 
multimodal or otherwise non-Gaussian) likelihoods; 
convergence of the product density is not assured to 
any "optimal" approximation to the target posterior. 
A promising direction for future work will be to inves- 
tigate adapting the EP-ABC idea of sampling from a 
pseudo-prior to the ideas in this paper of using ker- 
nel (or Gaussian mixture) density estimates for each 
likelihood factor. 



Barthelme and Chopin 2011 who have developed an Acknowledgements 



approach termed Expectation Propagation- ABC (EP- 
ABC) that shares similarities with ours. EP-ABC is 
an ABC adaptation of the Expectation Propagation 
approach developed by Minka 20011. EP-ABC uses 



essentially the same factorisation as ([2| and makes a 
Gaussian approximation to the density of each factor 
analogous to ([6|. But then EP-ABC proceeds rather 
differently: instead of drawing ABC samples for, say, 
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Appendix 

Expression for i?j2.- -,in j o,j2.---,jn i ^^'^ '^32 
analogous to ([8|-(10), are as follows: 



in (18), 




w 



02, 



m(i-") det(27rB 



32,---,3n' 



xl/2 



J|det(27ri/i 



-1/2 



i=2 



nn-p 



s=2 t>s 

Rst = 



32, 



-¥^^*s(3. 



^*t{3t)fRs 



Expressions for B', 



32,---,3n ' 



32, 



and w' 



J2, 



( 22 1 are given respectively by the right-hand sides 



of Q, ([21]), and ([25]) with B replaced by 5^,,...^,,, 
a replaced by Qja- ■ jn' ^^"^ replaced by Wj^,,,,^^. 
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Figure 5: Results for the Lotka-Volterra example of |3.3[ showing univariate and bivariate marginal posterior 
densities of 9 based on a posterior sample from a pMCMC algorithm, and from the Gaussian- and kernel- 
based PW-ABC approximations, t:^{0\X) and 7r'^(6'|A'). For the kernel approximation we used g = 5 as the 
smoothing parameter in (13). The contours shown in the bivariate plots are those that contain 5%, 10%, 
50%, 90% and 95% of probability mass. 



